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Abstract 


We present a gridded 8 km-resolution data product of the estimated composition of tree taxa 
at the time of Euro-American settlement of the northeastern United States and the statistical 
methodology used to produce the product from trees recorded by land surveyors. Composition 
is defined as the proportion of stems larger than approximately 20 cm diameter at breast height 
for 22 tree taxa, generally at the genus level. The data come from settlement-era public sur¬ 
vey records that are transcribed and then aggregated spatially, giving count data. The domain 
is divided into two regions, eastern (Maine to Ohio) and midwestern (Indiana to Minnesota). 
Public Land Survey point data in the midwestern region (ca. 0.8-km resolution) are aggregated 
to a regular 8 km grid, while data in the eastern region, from Town Proprietor Surveys, are 
aggregated at the township level in irregularly-shaped local administrative units. The prod¬ 
uct is based on a Bayesian statistical model fit to the count data that estimates composition 
on a regular 8 km grid across the entire domain. The statistical model is designed to handle 
data from both the regular grid and the irregularly-shaped townships and allows us to estimate 
composition at locations with no data and to smooth over noise caused by limited counts in 
locations with data. Critically, the model also allows us to quantify uncertainty in our com¬ 
position estimates, making the product suitable for applications employing data assimilation. 

We expect this data product to be useful for understanding the state of vegetation in the north¬ 
eastern United States prior to large-scale Euro-American settlement. In addition to specific 
regional quesfions, fhe dafa producf can also serve as a baseline againsf which fo invesfigafe 
how foresfs and ecosysfems change affer infensive sefflemenf. The dafa producf is being made 
available af fhe NIS dafa porfal as version 1.0. 

Keywords: biogeography, species composifion, old-growfh foresfs, spafial modeling, Bayesian 
sfafisfical model, vegefafion mapping 


1 Introduction 


Historical datasets provide critical context to understand forest eeology. They allow researchers 
to define ‘baseline’ conditions for conservation management, to understand ecosystem proeesses 
at decadal and centennial seales, to traek forest responses to shifting climates, and, particularly in 
regions with widespread land use change, to understand the extent to which forests after conversion 
and regeneration differ from the original forest eover. 

Euro-Ameriean settlement and subsequent land use ehange oeeurred in a time-transient fash¬ 
ion aeross North Ameriea and were aeeompanied by land surveys needed to demareate land for 
land tenure and use. Various systems were used by surveyors to loeate legal boundary markers, 
usually by reeording and marking trees adjaeent to survey markers. These data provide vegetation 
information that ean be mapped and used quantitatively to represent the period of settlement. Early 
surveys (from 1620 u ntil 1825) in the northeastern United States pr ovide spatially-aggregated data 
at the township level ( Cogbill et ah . 20021: Thompson et ah . 2013 ). with typical township size on 
the order of 200 km^ and no information about the loeations of individual trees; we refer to these 
as the Town Proprietor Survey (TPS). Eater surveys after the establishment of the U.S. Publie Eand 
Survey System (PES) by the General Eand Office (GEO) provide point-level data along a regular 
grid, with one-half mile (800 m) spaeing, for Ohio and westward during the perio d 1785 to 1907 
( Bourdo . 19561: PattisonL 1957 : Sehulte and Mladenoff . 200 ll: Goring et ahl 20161) . At eaeh point 
2-4 trees were identified, and the eommon name, diameter at breast height, and distanee and bear- 


2 


































ing from the point were reeorded. Survey instruetions during the PLS varied through time and by 
point type. Aeeounting for this variation requires data sereenin g to maximize eonsi steney among 
points and the applieation of spatially-varying eorreetion faetors Goring et"^ ( 2016h to aeeurately 
assess tree stem density, basal area and biomass from the early settlement reeords, but the impaet 
on eomposition estimates is limited ( Liu et ah . 201 lb . Surveyors sometimes used ambiguous eom- 
mon names, whieh requi res matehing to seientifie names and standardization (IMladenoff et ah . 
2002 : Goring et"^ 20161) . 

Logging, agrieulture, a nd land abandonment have left an indelible mark on forests in the 


northeastern United States (IFoster et al.L 1 19981: iRhemtulla et al.L l2009bl: [Thompson et al.L 12013 : 


Goring et"^ 2016). However most studies have assessed t hese ef feets in individual states or 


smaller domains (IFriedman and Reiehl.l2005l: IRhemtulla et al.Ll2009al) and with various spatial res- 
olutio ns, from townships (36 square miles) to forest zones of hundreds or thousands of square 
miles. Goring et al. ( 20161) provide a new dataset of forest eomposition, biomass, and stem density 
based on PLS data for the upper Midwest that is resolved to an 8 km by 8 km grid eell seale, pro¬ 
viding broad spatial eoverage at a spatial seale that c an be compared to modern forests using Forest 
Inventory and Analysis products ( Gray et al. . 20121) . Combined with additional, coarsely-sampled 
PLS data from Illinois and Indiana, newly-digitized data from southern Michigan, and with the 
TPS data, this gives us raw data for much of the northeastern United States. However, there are 
several limitations of using the raw data that can be alleviated by the use of a statistical model to 
develop a statistically-estimated data product. First, the PLS and TPS data only provide estimates 
of within-cell variance that do not account for information from nearby locations. Second, there 
are data gaps: the available digitized data from Illinois and Indiana represent a small fraction of 
those states, and missing townships are common in the TPS data. Third, the TPS and PLS data 
have fundamentally different sampling design and spatial resolution. Our statistical model allows 
us to provide a spatially-complete data product of settlement-era tree composition for a common 8 
km grid with uncertainty across the northeastern U.S. 

In Section 12 we describe the data sources, while Section [2 describes our statistical models. 
In Section 0] we quantitatively compare competing statistical specifications, and in Section |2 we 
describe the final data product. In Section [2 we discuss the uncertainties estimated by and the 
limitations of the statistical model, and we list related data products under development. 


2 Data 

The raw data were obtained from land division survey records collated and digitized from across 
the northeastern U.S. by a number of researchers (Fig. [T])- For the states of Minnesota, Wisconsin, 
Illinois, Indiana, and Michigan (the midwestem subdomain), digitized data are available at PLS 
survey point locations and have been aggregated to a regular 8 km grid in the Albers projection. 
(Note that for Indiana and Illinois, at the moment trees are associated with township centroids and 
then assigned to 8 km grid cells based on the centroid, but in the near future we will have point 
locations available for each tree.) For the states of Ohio, Pennsylvania, New Jersey, New York 
and the six New England states (the eastern subdomain), data are aggregated at the township level. 
We make predictions for all of the states listed above; these constitute our core domain. There 
are also data from a single township in Quebec and a single township in northern Delaware; these 
data help inform predictions in nearby locations within our core domain, but predictions are not 
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Figure 1: Spatial domain of the northeastern United States, with loeations with data shown in gray. 
Loeations are grid cells in midwestern portion and townships in eastern portion. In addition to 
locations without data being indicated in white, grid cells completely covered in water are white 
(e.g., a few locations in the northwestern portion of the domain in the states of Minnesota and 
Wisconsin). 


made for Quebec or Delaware. Digitization of PLS data in Minnesota, Wisconsin and Michigan is 
essentially complete, with PLS data for nearly all 8 km grid cells, but data in Illinois and Indiana 
represent a sample of the full set of grid cells, with survey record transcription ongoing. Data for 
the eastern states are available for a subset of the full set of townships cov ering the domain; th e 
TPS data for some townships were lost, incomplete, or have not been located dCogbill et al.L 120021) . 

Note that surveys occurred over a period of more than 200 years as European colonists (before 
U.S. independence) and the United States settled what is now the northeastern and midwestern 
United States. Our estimates are for the period of settlement represented by the survey data and 
therefore are time-transgressive; they do not represent any single point in time across the domain, 
but rather the state of the landscape at the time just p rior to widespread Euro-American settle¬ 
ment and land use ( Whitney . 19961: Cogbill et al.l 2002h . These forest composition datasets do in- 
clude the effects o f Native American land use and early Euro-American settlement activities (e.g.. 
Black et ^ 2006h . but it is li kely that the imprin t of this earlier land use is highly concentrated 


rather than spatially extensive (IMunoz et al .1.120 141) . 

Extensive details o n the upper Midwes t (Minnesota, Wisconsin, Michigan) data and process¬ 
ing steps are available Goring et al. ( 2016h : key elements include the use of only corner points, 
the use of only the two closest trees at each corner point, spatially-varying correction factors 
for sampling effort, and a standardized taxonomy table. The lower Midwest (Illinois, Indiana) 
data were purchased from the Indiana State Archives (Indiana) and Hubtack Document Resources 
(hubtack.com; Illinois) and processed using similar steps as for the upper Midwest data. Digiti¬ 
zation of the Illinois and Indiana data is still underway, so many grid cells contained no data at 
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the time the statistieal model was fit. Note that the number of trees per grid cell varies depend¬ 
ing on the number of survey points in a cell, with an average of 124 trees per cell. The gridded 
data at the 8 km resolution for the rnidwest subdomain are available through the NIS data portal 
(IGoring and University of Wisconsinll2016h . The TPS data were compiled by C.V. Cogbill from a 
myriad of archival sources representing land division su rveys conducted in connection with local 
settlement and are available through the NIS data portal ( Cogbill . 2016al 9l. 

The aggregation into taxonomic groups is primarily at the genus level but is at the species level 
in some cases of monospecific genera. We model the following 22 taxa plus an “other hardwood” 
category: Atlantic white cedar {Chamaecyparis thyoides). Ash (Fraxinus spp.). Basswood {Tilia 
americana). Beech {Fagus grandifolia), Birch {Betula spp.). Black gum/sweet gum {Nyssa syl- 
vatica and Liquidambar styraciflua). Cedar/juniper (Juniperus virginiana and Thuja occidentalis). 
Cherry {Prunus spp.). Chestnut (Castanea dentata). Dogwood (Cornus spp.). Elm (Ulmus spp.). 
Fir (Abies balsamea). Hemlock (Tsuga canadensis). Hickory (Carya spp.), Ironwood (Carpinus 
caroliniana and Ostrya virginiana). Maple (Acer spp.). Oak (Quercus spp.). Pine (Pinus spp.). 
Poplar/tulip poplar (Populus spp. and Liriodendron tulipifera). Spruce (Picea spp.). Tamarack 
(Larix laricina), and Walnut (Juglans spp.). Note that in several cases (black gum/sweet gum, 
ironwood, poplar/tulip poplar, cedar/juniper), because of ambiguity in the common tree names 
used by surveyors, a group represents trees from different genera or even families. For the mid- 
western subdomain we do not fit statistical models for Atlantic white cedar and chestnut as these 
species have 0 and 7 trees present, respectively. The taxa grouped into the other hardwood category 
are those for which fewer than roughly 2000 trees were present in the dataset; however, we include 
Atlantic white cedar explicitly despite it only having 336 trees in the dataset because of specific 
ecological interest in Atlantic white cedar wetlands. 

Diameters are only recorded in the PFS data. Although surveyors avoided using small trees, 
there was no consistent lower diameter limit. The PFS data generally represent trees greater than 8 
inches (~20 cm) diameter at breast height (dbh), but with some trees as small as 1 inch dbh (smaller 
trees were much more common in far northern Minnesota). TPS data have no information about 
dbh, but the trees were large enough to blaze and are presumed to be relatively large trees useful 
for marking property boundaries. 

There are approximately 860,000 trees in the midwestern subdomain and 420,000 trees in the 
eastern subdomain. In the midwestern subdomain, oak is the most common taxon and pine the 
second most common, while in the eastern subdomain oak is the most common and beech the 
second most common. 

Our domain is a rectangle covering all of the states using a metric Albers (Creat Fakes and St. 
Fawrence) projection (PROJ4: EPSC:3175), with the rectangle split into 8 km cells, arranged in a 
296 by 180 grid of cells, with the centroid of the cell in the southwest corner located at (-71000 m, 
58000 m). For the midwestern subdomain we use the western-most 146 by 180 grid of cells when 
fitting the statistical models. For the eastern subdomain we use the eastern-most 180 by 180 grid 
of cells and then omit 23 rows of cells in the north and 17 rows of cells in the south, as these grid 
cells are outside of the states containing data. 


3 Statistical model 

We fit a Bayesian statistical model to the data, with two primary goals: 
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1. To estimate composition on a regular grid across the entire domain, filling gaps where no 
data are available, and 

2. To quantify uncertainty in composition at all locations. Even in grid cells and townships with 
data, we wish to quantify uncertainty because the empirical proportions represent estimates 
of the true proportions that could be calculated using the full population of all the trees in a 
grid cell or township. 

At a high level, the Bayesian statistical model estimates composition across the domain, even in 
locations with sparse or no data, by combining the raw composition data with the assumption that 
composition varies in a smooth spatial fashion across the domain. The information in the data is 
quantified by the data model, also known as the likelihood. The assumption of smoothness is built 
into the model by representing the true unknown spatially-varying composition using a statistical 
spatial process representation that induces smoothing of estimates across nearby locations. This 
spatial process representation is a form of prior distribution and is a function of model param¬ 
eters called hyperparameters that determine the correlation structure of the process and are also 
estimated based on the data. 

The result of fitting the Bayesian model via Markov chain Monte Carlo (MCMC) is a set of 
representative samples from the posterior distribution for the composition in the 23 taxonomic 
groupings at each of the grid cells. These samples are the data product (described further in the 
Section[5]) and can be used in subsequent analyses. The mean and standard deviation of the samples 
for each pair of cell and taxon represent our best estimate (i.e., prediction) of composition and a 
Bayesian “standard error” quantifying the uncertainty in the estimate. 

In the remainder of this section we provide the technical specification of the model and of the 
computations involved in fitting the model. 


3.1 Data model 

We start by describing the basic model for those states for which we have raw data on the 8 km 
grid, and in Section 1331 we describe the extension of the model to accommodate data aggregated 
at the township level. 

The statistical model treats the observations as coming from a multinomial distribution with a 
(latent) vector of proportions for each grid cell, 

Vi ~ Multi(ni, 6»(si)), 


where yi is the vector of counts for the P taxa at the ith cell, rii is the number of trees counted in 
the cell, and 9{si) is the vector of unknown proportions for those taxa at that cell. Note that we use 
a standard multinomial distribution without overdispersion, because the set of t rees in the dataset 
is roughly uniformly sampled across the cells or townships (IGoring et al.Ll2016t) . 

The proportions, Op{si), p = 1,..., P, are modeled spatially by a set of P Gaussian spatial 
processes, one per taxon, ap{si), p = 1,... ,P. This collection of processes defines a mul¬ 
tivariate spatial process for composition. The Q!p(s) processes are defined on the 8 km grid, 
ctp = {ap(si),..., ap{sm)} for the m grid cells. In Section [T4l we introduce a multinomial probit 
model that relates the ap{s) processes to the proportion processes, Op{s), via the introduction of 
latent variables, with an implicit sum-to-one constraint, J2p=i^p(^) = 1- A multinomial probit 


6 









model is similar to logistic regression, used for modeling a binary outcome based on an underlying 
probability the outcome will occur, but generalizes to modeling a categorical variable based on 
probabilities for each category. 

The critical component of the statistical model is the representation of ap(s) as a spatial pro¬ 
cess. A spatial process is a statistical representation that models spatially-correlated values. It 
provides a prior structure that serves to smooth across noise in the observations and allows for 
prediction at locations based on information from nearby locations, including interpolation to loca¬ 
tions with no data. Apart from the sum-to-one constraint, the taxa are considered to be independent 
in the prior. We did not want to impose any structure that ties the different taxa together, as any 
correlation will likely vary across space. 

In the next section, we consider two spatial m odels to define t he str ucture of the 0 ^( 5 ) pro¬ 


cesses, a standard conditional autoregressive model (IBanerjee et al.Ll2004t) and a G aussian Markov 


rando m field (MRF) approximation to a Gaussian process with Matem covariance ( Lindgren et al.l 


201 Ih . These models are specific statistical formulations of spatial processes that represent spatial 


correlation by defining neighborhoods around each location that are used to help inform predictions 
at the location. 


3.2 Spatial process models 

MRF models represent the neighborhood information by working directly with the precision matrix 
(the inverse of the covariance matrix) of the values of the spat ial process, so calculation of the prior 
density of ap is computationally simple ( Rue and Held . 2005b . However, in situations in which the 
likelihood is not normal, such as our multinomial likelihood, it can be difficult to set up effective 
MCMC algorithms that are able to move in the high-dimensional space of ap. The latent variable 
representation of Section 13.41 helps to alleviate this problem. Next we describe two alternative 
spatial models that we considered; in Section S we evaluate the models on held-out data to choose 
between the two. 


Standard conditional autoregressive models Our fir st model is a standard conditional autore¬ 
gressive (CAR) model; technical details can be found in ( Banerjee et al.l 2004 ). We use a standard 
form of this model that treats the four cardinal neighbors of each grid cell as the neighbors of the 
grid cell. The corresponding precision matrix has diagonal elements, Qa, equal to the number 
of neighbors for the i\h area (i.e., four except for cells on the boundary of the domain), while 
Qik = — 1 (the negative of a weight of one) when areas i and k are neighbors and Qik = 0 when 
they are not. This gives the following model for the values of ap{si) collected as a vector across 
all of the grid cells, i = 1,... ,m: 


ap ~ 


N(0,a2g-^ 


The use of the generalized inverse notation indicates that Q is not full-rank, but is of rank m — 1; 
this gives an improper prior on an implicit overall mean for the process values. Note that we 
specify an explicit mean of zero because a non-zero mean would not be identifiable in light of 
the implicit mean. This specification is called an intrinsic conditional autoregression (ICAR) and 
we can write Q = D — C where C is the m x m adjacency matrix defining the neighborhood 
relation of the locations; that is Cik = 1 if locations i and k are neighbors and zero otherwise. The 
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matrix D is an m x m diagonal matrix containing the row sums of matrix C as the diagonal entries, 

m 



k=\ 

We refer to this as the CAR model. 


Gaussian process approximati on Gaussian proeesses (GP) are also standard models for spatial 
proeesses dBanerjee et all 2004 ). GP models are eomputationally ehallenging for larg e datasets be 


cause of computational manipulations involving large covariance matrices. Given this. lLindgren et al.l 
( 201 lb proposed a new framework for using Gaussian MRFs (GMRFs) as approximations to GPs, 
based on the use of stochastic partial differential equations (SPDEs). 

Gaussian processes are generally constructed using one of a number of correlation functions 
that define how the strength of correlation between the values of the process at two locations 
decays as a function of the distance between the locations. We consider Gaussian processes in 
the commonly-used Matem class, using the following parameterization of the Matern correlation 
function, 

1 


R{d) = 


\ p J \ P J 


where d is Euclidean distance, p is the spatial range parameter, and /Cy(-) is the modified Bessel 
function of the second kind, whose order is the smoothness (differentiability) parameter, z/ > 0. 
o = 0.5 gives the exponential covariance. Eor any pair of locations, R{d) defines the correlation 
of the process, (i.e., ap{s) in our context), as a function of the distance between the locations. 
Considering all pairs of locations, thi s defin es a correlation matrix for all locations of interest. 

The approach of lEindgren et al.l (1201 lb allows us to consider MRE approximations to the 
Matern -based GP for u = 1 and u = 2. Our second spatial model is this Eindgren approxi¬ 
mation for Matem -based GPs with u = 1. To implement this Eindgren model, one modifies the 
Q matrix define d previously as follow s based on the technical specification of the precision ma¬ 
trix provided in lEindgren et al.l (1201 ih . Eet a = 4 -f The diagonal elements of Q are 4 -f a^. 
The entries corresponding to cardinal neighbors are —2a. Those for diagonal neighbors are 2, and 
those for 2nd-order cardinal neighbors are 1. This extends the neighborhood structure relative to 
the CAR model and parameterizes it as a function of p. 

The primary difference between the CAR and Eindgren models is that the Eindgren model 
provides an additional degree of freedom by estimating p. In particular p allows us to estimate the 
locality of the spatial smoothing. As p decreases, the model uses increasingly localized data to 
estimate the compositional proportions at a given loca tion, effectively averag ing the empirical pro¬ 
portions over smaller neighborhoods. In general, the lOndgren et al.l (1201 ih model will generally 
provide for a smoother estimate than the CAR model ( PacioreJ 2013 ). 

To ensure that the parameter is mathematically equivalent between the two models, we 
reparameterize, producing our second model: 




f 2 ^TT 

N I /ip, CTp • —Q{pp) 

V Pp 


-1 


We refer to this model as the SPDE model. 
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3.3 Prior Distributions 


The IC AR specification contains a set of hyperparameters {(Tp},p= 1,.. .P. Following iGelman 
(l2006h we use a uniform distribution on each cXp parameter, with upper bound of 1000. For the 
SPDE model we also have hyperparameters {/ip}, which we give flat, non-informative priors (trun¬ 
cated at ±10), and {pp}, which we give uniform priors on the interval (0.1, exp(5)). These various 
hyperparameters are unknown parameters that control the spatial structure of the two spatial mod¬ 
els and are estimated from both the data and the prior distributions just specified based on the 
Bayesian approach. 


3.4 Latent Variable Model 


It is well-known that devising an effective MCMC algorithm for models with latent Gaussian pro- 
cess(es) and a non- Gaussian likelihood is difficult (IRue and Heidi. l2005l: IChristensen et al.L l2006l: 
Tan and Notti 1201311 . To devel op an algorithm, we make use of a latent variable representation for 
the multinomial probit model ( McCulloch and Rossil 1994 ). The representation introduces latent 
variables that allow one to develop an MCMC sampling strategy that takes advantage of closed- 
form full conditional distributions (so-called Gibbs sampling steps) for ctp. 

Suppose that compositional counts are available at a number of locations. At location i, a 
sample of size n* observations is collected, and each observation (i.e., each tree) can be classified 
into P distinct categories. For a given tree j at location i, let Yij denote the response variable 
indicating the category. Let Y^j be associated with P latent variables Wiji ,..., Wijp such that Y^j = 
p if and only if W^jp = max (fkijp'}; in other words, the maximum of the set of latent variables 

p' 

{^ijp}p=i determines the category of observation j at location i. The final piece of the latent 
variable representation is the relationship between the W variables and the Q!p(s) processes. We 
have that 


Wijp ~ 


N(ap(sj), 1) 


independently for all of the Wijp values. Consider the following example with two locations that 
are neighbors and P = 2 categories. Each tree j at location i is associated with two variables 
Wiji and Wij 2 , governed by the latent variables Q!i(si) and 02 ( 5 *), respectively. Suppose that 
Q!i(sj) > a 2 {si) for a given location i. Then this model implies that any tree j is more likely to 
be labeled 1 than 2 at location i. The difference between Q!i(sj) and 0 ^ 2 (s*) explains the difference 
in probability of categories 1 and 2 at location i, and the similarity between ap(si) and Q!p(s 2 ) 
explains the correlation between the probabilities at locations 1 and 2 for category p. 


3.5 Model for township data 

We developed an extension of the model described in previous sections to account for data at a dif¬ 
ferent aggregation than our core 8 km grid. This extension introduces a new set of latent variables, 
one per tree, that indicate the grid cells in which the trees are located and that can be sampled 
within the MCMC as additional unknown parameters. Specifically, ctj is the latent “membership” 
variable for tree j in township t,t = 1,..., T. The prior for ctj is a discrete distribution that puts 
mass, //’ti, i = 1 ,..., m, proportional to the areal overlap between the township in which the tree 
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is located and the m grid eells, giving 

ctj ~ Multinom(l, {tpn: • • •, V'tm}), 

independently aeross all trees. Beeause the townships overlap a limited number of grid eells, most 
of the 'iptii - ■ ■ 1 i’tm values are zero. 

Using the latent variable representation, we have that Wtjp ~ N(Q!p(sctj), 1) ^r tree j in town¬ 
ship t. In updating the other parameters in the model during the MCMC (speeifieally the a values), 
we eondition on the eurrent values, {ctj}, whieh provides a “soft” (i.e., probabilistie) assignment 
of trees to grid eells that respeets both the known township in whieh the trees oeeurred and the 
uneertainty in whieh grid eells the trees oeeurred. 

Note that this prior represents the loeation of eaeh tree in a township as being independent 
of the other trees; this is somewhat unrealistie beeause it does not represent our knowledge that 
the trees in a township would be distributed somewhat regularly aeross the area of the township 
beeause the witness trees were used to indieate property boundaries. 


3.6 Computation 

The MeCulloeh and Rossi ( 1994h representation is eonvenient for MCMC sampling, partieularly 
in this high-dimensional spatial eontext, as it allows us to draw from the posterior eonditional 
distributions of the Wijp variables (these distributions are truneated normal) in elosed form and to 
draw the entire veetor of latent proeess values for eaeh taxon, ap, as a single sample that respeets 
the spatial dependenee strueture for eaeh taxon. 

While the latent variable representation provides great advantages in the MCMC sampling 
for eaeh ap eompared to joint Metropolis updates or updating eaeh loeation individually, there is 
still strong dependenee between the hyperparameters, {ap, fip, pp} and the latent proeess values 
(as well as between the latent proeess values and the latent Wijp variables). To address the first, 
we developed a “eross-level” joint updating strategy for the CAR model in whieh we propose 
(f)p = ap,p = 1,..., P, (and for the SPDE model, (f)p G {pp, {ap, pp)}) via a Metropolis-style 
random walk and then given the proposed value, cffp, propose ap from its full eonditional distri¬ 
bution given 0* and the latent Wp variables, where Wp is the veetor of all Wijp values for taxon 
p: Wp = {Wijp}, i = 1,... ,m;j = 1,..., n*. This is equivalent to sampling from the marginal¬ 
ized (with respeet to ap) distribution of (pp eonditional on Wp. For these various joint samples of 
hyperparameters and ap, we use adaptive Metropolis sampling ( Shabv and Wellsl 2011 ). 

The full deseription of the MCMC sampling steps is provided in the Appendix. In addition, in 
the latent variable representation, Op{s) never appears explieitly and eannot be ealeulated in elose 
form. Instead we use Monte Carlo integration over Wijp, p = 1,..., P to estimate 6p{si), also 
deseribed in the Appendix. 

The model is implemented in R (IR Core Teaml 120141) with core co mputational calculations 
coded in C-i-i- using the Repp package (lEddelbuettel and Francoisl.l201 ih . We als o make extensive 
use o f sparse matrix representations and algorithms, using the spam package in R (IFurrer and Sain . 
2010h . All code is available on Github, including pre- and post-processing code, at 


https://github.com/PalEON-Project/composition. 
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4 Model comparison 


4.1 Design 

We compared the CAR and SPDE models by holding out data from the fitting process and assessing 
the fit of the model on the held-out data. We used two experimentswith held-out data: 

1. The first experiment used a subregion containing most of Minnesota and a small amount of 
western Wisconsin, defined to be the cells whose x-coordinate was less than 300,000 m (this 
defines a north-south line that approximately goes through Duluth, Minnesota) and hereafter 
referred to as the “Minnesota subregion”. We chose this subregion for evaluation because of 
its high data density, allowing us to experiment with the effects of increasing data sparsity on 
model performance. We held out all data from 95% of the cells in this Minnesota subregion, 
with cells selected at random. This was meant to assess the ability of the model to interpolate 
from a sparse set of cells/townships and mimics the limited data in Illinois and Indiana. 

2. We held out 5% of the trees from all of the trees in the dataset for the midwestern subdomain 
(leaving aside the held-out Minnesota subregion cells). This was meant to assess the ability 
of the model to estimate the composition in cells in which data were available. 


Finally, in a separate sensitivity analysis we instead left out 80% of the cells in Minnesota subre¬ 
gion at random. This variation on the first experiment above was meant to indicate whether our 
model comparison conclusions would be robust as the digitization process for Illinois and Indiana 
progresses and provides us with increasingly dense data. 

There has been extensive work in the statistical literature on good metrics to use to compare the 
predictive ability of models; these metrics are referred to as scoring rules. A general conclusion 
from this work is that predictive distributions should maximize sharpness subject to calibration. 
That is, the predictive distribution should be as narrow as possible while be ing calibrated such 
that the observations are consistent with the distribution ( Gneiting et ah . 2007b . When thinking in 
terms of prediction intervals as summaries of the predictive distribution, we seek intervals that are 
as narrow as possible while still covering the truth the expected proportion (e.g., 95% for a 95% 
prediction interval) of the time. 


Following the suggestions in iGneiting et al.l (l2007h . we considered the following metrics: Brier 
score, log predictive density, mean square prediction error, mean absolute error, and coverage and 
length of prediction intervals. Further details on each are given below. For experiment 1, we define 
Yi = {Yii ,..., Yip} as the count of all trees in held-out cell i and for experiment 2, Yi is the count 
of held-out individual trees in the cell, while ytjp is an indicator variable taking value either 0 or 
1 depending on whether the jth held-out tree in the ith cell is of taxon p. 6ip = Yipjrii is the 
empirical proportion in category p for the Ui held-out trees in cell i. We calculated each of the 
metrics in two ways. First, we used the posterior mean composition estimates (as an evaluation of 
our core predictions), with 6*p(s) being the posterior mean. Second, we averaged the metric over 
the posterior samples (as an evaluation of our full data product, including uncertainty), taking 6*p(s) 
to be an individual MCMC sample and then averaging the metric over all the posterior samples. 


1. Brier score: iGneiting et al.l (120071) suggest this metric, which has been in use for decades. 
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For multi-category as opposed to binary outeomes, this takes the form 

^ m rii P 
2 = 1 j = l p=l 

where n = YlT=i number of held-out trees for a given experiment and j indexes 

aeross held-out trees in cell i. 


2 . 


Log predietive density: This metrie takes the log of the probability density of held-out ob¬ 
servations under the fitted model, Yi ~ Multinom(?T,j, {0i(sj),..., 6'p(si)}), summing on the 
log seale across all of the held-out data. 


While in principle, this metric should be optimal (l&niaiic and Draped. l2014t) . it is very 
sensitive to small predictions near zero ( Gneiting et al.l 12007 ). Even worse, our Monte Carlo 
estimation of 0 used 10000 samples, so in some cases Op{s) = 0. When a tree is present in 
a cell but its corresponding proportion is 0, this gives a log density of — oo, preventing use 

in such cases, but given 


1 


of the metric. As an informal solution to this we set 9p{s) = ^qqqqq 
these issues we treat the log predietive density as a seeondary measure. 


3. (Experiment 1 only) Weighted root mean square prediction error (RMSPE), 


\ 




Pn ^ ^ 

2=1 p=l 


and mean absolute error (MAE) 


^ ra P 

^ ^— ^p(s)| : 

2=1 p=l 

These metries ealeulate the error of the estimated proportions relative to the empirieal pro¬ 
portions based on the held-out trees, averaging over eells and taxa. We weight by the number 
of held-out trees in eaeh eell to aeeount for the greater variability in the empirieal proportions 
in loeations with few held-out trees. 


4. (Experiment 1 only) Coverage and length of 95% prediction intervals for Y^p. We considered 
only eells with at least 50 trees to foeus our assessment on oases where empirieal proportions 
were reasonably oertain and avoid being strongly influeneed by predietive inferenoe for eells 
where observational variability dominates. 

Note that all of the metries exeept ooverage and interval length oan be applied to individual poste¬ 
rior samples and therefore allow us to estimate the posterior probability that one model has a lower 
(better) value of the metrie than the other model by simply ealeulating the proportion of samples 
for whieh the model has a lower value of the metrie. Also note that in addition to allowing eom- 
parison between models the MAE and RMSPE metrics allow one to assess absolute performance 
of each model in predicting composition. 

In our initial exploratory fitting, we notieed that the SPDE model produced boundary effects 
in the predicted eomposition near the edges of the eonvex hull of the observations. To attempt 
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Table 1: Predictive ability based on several predictive metric criteria for the CAR and SPDE spatial 
models when holding out 95% of entire cells of data in Minnesota. Smaller values are better. 



Posterior mean of metric 

Metric of posterior mean predictions 


CAR model 

SPDE model 

Posterior Prob. 

CAR < SPDE 

CAR model 

SPDE model 

Brier 

0.819 

0.844 

0.98 

0.738 

0.733 

Negative Log Density 

466325 

510383 

1.00 

394003 

394554 

Mean Absolute Error 

0.0364 

0.0383 

0.98 

0.0275 

0.0269 

Root Mean Square Eiror 

0.0897 

0.0960 

0.97 

0.0647 

0.0627 


Table 2: Coverage and length of prediction intervals for the CAR and SPDE spatial models when 
holding out 95% of entire cells of data in Minnesota. Coverage values near 0.95 are optimal, while 
shorter intervals are better. _ 



CAR model 

SPDE model 

Coverage 

0.977 

0.978 

Mean Interval Eength 

0.129 

0.142 

Median Interval Eength 

0.037 

0.033 


to alleviate this, we added a buffer zone with a width of six grid cells around our entire original 
domain, but note that the boundary effects were still evident even after inclusion of the buffer. Eor 
the model comparison, we included this buffer for both the SPDE and CAR models. 

We ran each model for 150,000 iterations. After discarding 25,000 iterations for bum-in, we 
retained a posterior sample of 250 subsampled iterations - we use a subsample instead of the full 
125,000 post-burn-in iterations to reduce post-processing computations and storage needs. 

4.2 Results 

Here we summarize the results of our analyses that inform the choice between the CAR and SPDE 
models. 

Full cell hold-out experiment Eor Experiment 1 (full cells held out) for cells in the Minnesota 
subregion held out of the fitting process, the CAR model outperforms the SPDE model based 
on the posterior distribution over the predictive metric values (Tabled]). Eor the posterior mean 
predictions, the SPDE model appears to outperform the CAR model to a lesser degree, but we do 
not have any uncertainty estimates for this comparison. Coverage and interval lengths are similar 
between the two models (Table O. Erom a practical perspective, based on the difference in mean 
absolute error, the differences between the models are small (Tabled])- 

The results for the variation on Experiment 1 in which the proportion of cells that are held out 
decreases from 95% to 80% show that the SPDE model generally outperforms the CAR model, but 
again differences from a practical perspective, based on mean absolute error, are limited (Tables 
M. 
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Table 3: Predictive ability based on several predictive metric criteria for the CAR and SPDE spatial 
models when holding out 80% of entire cells of data in Minnesota. Smaller values are better. 



Posterior mean of score 

Score of posterior mean predictions 


CAR model 

SPDE model 

Posterior Prob. 

CAR < SPDE 

CAR model 

SPDE model 

Brier 

0.773 

0.765 

0.10 

0.710 

0.710 

Negative Log Density 

355928 

353987 

0.25 

311525 

311902 

Mean Absolute Error 

0.0309 

0.0296 

0.10 

0.0226 

0.0223 

Root Mean Square Eiror 

0.0763 

0.0739 

0.02 

0.0533 

0.0530 


Table 4: Coverage and length of prediction intervals for the CAR and SPDE spatial models when 
holding out 80% of entire cells of data in Minnesota. Coverage values near 0.95 are optimal, while 
shorter intervals are better. _ 



CAR model 

SPDE model 

Coverage 

0.981 

0.972 

Mean Interval Eength 

0.112 

0.103 

Median Interval Eength 

0.028 

0.022 


Individual tree hold-out experiment In Experiment 2 (individual trees held out), we have ev¬ 
idence (posterior probability of 0.93) that the SPDE model is better based on the Brier score, but 
the Brier score values for the two models are numerically almost the same (Table [5]). 

Choice of spatial model The differences between models are not consistent across the various 
comparisons, so there is not a clear choice. In our final data product we use the CAR model, 
for three reasons. Eirst, the CAR model has modestly better performance when data are sparse, 
as is still the case for Illinois and Indiana. Second, the model is simpler and easier to explain, 
and computations can be done more quickly. Third, predictions from the SPDE model showed 
boundary effects, with some taxa showing non-negligible posterior mean values at the edges of 
the domain, well away from where the taxa were present in the empirical data. This included 
non-negligible values within (but near the edge of) the convex hull of locations with data. 


Table 5: Predictive ability based on several predictive metric criteria for the CAR and SPDE spatial 
models when holding out 5% of trees. Smaller values are better. 



Posterior mean of metric 

Metric of posterior mean predictions 


CAR model 

SPDE model 

Posterior Prob. 

CAR < SPDE 

CAR model 

SPDE model 

Brier 

0.662 

0.661 

0.07 

0.657 

0.657 

Negative Eog Density 

51757 

51626 

0.01 

50705 

50736 
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5 Data product 


The final data product is a dataset that contains 250 posterior samples of the proportions of each of 
the 23 tree taxa at each grid cell in the states in our domain of the northeastern United States. 

For this final data product, we ran the model using the CAR specification with all of the data 
(including the data held out in the model comparison analyses) for 150,000 iterations with the 
same bum-in and subsampling details as described in Section IH Based on graphical checks and 
calculation of effective sample size values, mixing was generally reasonable, but for some of the 
hyperparameters was relatively slow, particularly for less common taxa. Despite this, mixing for 
the variables of substantive interest - the proportions - was good, with effective sample sizes for 
the final product generally near 250. 

Maps of estimated composition for the full domain for several taxa of substantive interest illus¬ 
trate the results, contrasting the raw data proportions, the posterior means, and posterior standard 
deviations as pointwise estimates of uncertainty (Fig. O. We also present the posterior means for 
all 23 taxa (Fig. [3]). 

The data product is publicly available at the NIS Data Portal under the CC BY 4.0 license as 
version 1.0 as of January 2016 (IPaciorek et al.l. 120161) . The product is in the form of a netCDF-4 
file, with dimensions x-coordinate, y-coordinate, and MCMC iteration. There is one variable per 
taxon. In addition, dynamic visualizations of the product using the Shiny tool are available at 
https://www3.nd.edu/~paleolab/paleonproject/maps. The PalEON Project (in particular the first 
author) will continue to maintain this product, releasing new versions as additional data in Illinois, 
Indiana and Ohio are digitized. Note that digitization of data from Illinois and Indiana is ongoing, 
and digitization of additional data from Ohio is planned as well. As a result, at some point we 
expect to have complete data for the mid western half of the domain. 


6 Discussion 


In the parts of the modeled region with spatially complete data (in particular, Minnesota, Wis¬ 
consin, and Michigan), the statistical estimates of forest composition closely match the patterns 
apparent in the raw data (Fig. [I]), as expected. In these areas, the estimated tree composition 
from the model has the advantage of downweighting unusual or outlier values in the empirical pro¬ 
portions of individual grid cells, which are likely due to stochastic sampling variation within that 
grid cell (compare the first two columns in Fig. Some stochastic variation is expected given 
that, even in the mos t spatially co r nplete regions, each grid cell contains an average of 124 trees 
(120-135 is typical) (iGoring et al.l. 12016h and some cells contain many fewer trees. Hence, some 
smoothing of this stochastic variation is appropriate. This smoothing is based on information on 
data from nearby cells, and the estimates from the model reflect the smooth trends in forest com¬ 
position across the spatial domain. A partial cost is that these maps can smooth out sharp ecotones 
or other forms of true spatial heterogeneity, particularly in areas with sparse data (including areas 
with low tree density). For example, the sharp increase in Elm along the Minnesota River (Eig. [2l 
first column) likely represents a real ecotone in the settlement-era vegetation. Vegetation gradients 
and eco tones were shyper i n the settlement-era forests in the upper Midwest than in contemporary 
forests ( Goring et ah . 2016h . and the modeled estimates may partially obscure this change. Users 
interested in using the original unsmoothed data are directed to the data product described earlier 
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Figure 2: Empirical proportions from raw data (column 1), predictions in the form of posterior 
means (column 2) and uncertainty estimates in the form of posterior standard deviations - repre¬ 
senting standard errors of prediction (column 3) for select taxa. 
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fitted proportions 

0 . 00 - 0.01 
0.01 - 0.05 
0 . 05 - 0.10 
0 . 10 - 0.15 
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■ 0.60 - 0.80 
0.80 - 1.00 


Figure 3: Predictions (posterior means) for all taxa over the entire domain. 
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( Goring and University of Wisconsinl 2016h . Additional investigation of other statistieal represen¬ 
tations to better eapture sharp gradients is of interest, in partieular nonstationary spatial models 
and use of eovariates. Potential e nvironmental eovariates inelude soils, firebreaks, and topography 
( Grimm . 1984 : Shea et ah . 2014h . Here, however, we ehose to limit our model to be exelusively 
a funetion of spatial distanee in order to avoid dependenee on the environmental drivers of pre¬ 
settlement forest eomposition that might lead to eireular reasoning in subsequent inferenees based 
on this dataset. Use of eovariates eould also lead to predietion that a taxa is present well beyond 
its range boundary in plaees where data are sparse. 

A key advanee of this work over prior reeonstruetions of settlement-era vegetation lies in the 
estimates of uneertainty aeross the spatial domain. These estimates of uneertainty inelude the sam¬ 
pling uncertainty within grid cells (as do the within-grid cell estimates of uncertainty available 
from the raw proportions), but, because this is a spatial model, predictions and their associated 
uncertainty estimates are also informed by the information content of nearby cells. The maps of 
standard errors across species (Fig. |2l third column) highlight the advantages of this approach in 
areas of high data coverage (Minnesota, Wisconsin, Michigan) and in areas of sparse coverage 
(e.g., Illinois, Indiana, parts of Ohio). Where there are not large gaps in the data, the model pro¬ 
vides low and fairly smooth estimates of uncertainty. Uncertainty is generally higher in the eastern 
subdomain than in the areas of the midwestern subdomain with high data coverage because of 
missing townships and lower sampling density even in townships with data. In areas of sparse 
coverage and in areas with low tree density (e.g., southwestern Minnesota), the standard error of 
our estimates increases appropriately. Nevetheless, these uncertainties surround reasonable esti¬ 
mates of trends in composition. For example, the model does a good job of capturing the oak 
ecotone in Indiana and Illinois, representing a shift from oak savannas and woodlands to closed 
mesic forests (Fig. Experiment 1 showed that both models predicted composition at cells with 
no data reasonably well, mimicking the case of sparsely sampled data and giving confidence in the 
broad spatial patterns predicted in more poorly sampled regions, particularly those with regular, 
but sparse sampling that mimic the experiment (Illinois and Indiana, but not Ohio). The apparent 
blockiness of uncertainty estimates in a few places such as Ohio is caused by spatial gaps and 
variations in sampling resolution. Absolute uncertainty generally increases with abundance for all 
taxa (Fig. [2l column 3). 

The exploration of alternative approaches to spatial modeling of composition showed similar 
results for the SPDE and CAR models, both in terms of prediction accuracy and performance of 
prediction intervals. Small differences among the various metrics of goodness of fit favored each 
model in turn, but applied users of the models would find little pragmatic difference affecting 
scientific inference. Ultimately, we slightly favor the CAR model, because it avoids the boundary 
effects apparent in the SPDE model at the edges of the domain. 

The models presented here estimate only the relative abundance of tree taxa, which does not 
directly tell us about tree density or other aspects of vegetation structure. This becomes a particular 
limitation for interpreting vegetation where tree s become sparse at the prairie-forest transition from 
northern Minnesota through southern Illinois (ITranseaul Il935h . Our model (correctly) predicts 
that tree composition there is dominated by oak, but this is less useful considering the sparseness 
of trees. This limitation can be addressed by developing estimates of absolute abundance (e.g., 
biomass) rather than compositional estimates. A gridded dataset of biomass, sten i density, and 
basal area is already available for Minnesota, Wisconsin, and northern Michigan ( Goring et al.l 


2016h . based on the PES data. An extension to southern Michigan, Illinois, and Indiana is planned. 
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We are eurrently developing statistieal estimates of biomass for Minnesota, Wiseonsin, and Miehi- 
gan using a statistieal model applied to the gridded biomass dataset, with extension to Illinois and 
Indiana planned. We also plan to estimate stem density and basal area using a similar approaeh to 
that used for biomass. 
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7 Appendix 

7.1 MCMC details 

Ui 

Define Wip = ^ijp the average of the W values for the pth taxon in the fih grid cell and 

i=i 

Wp = {wip}, i = 1,... ,m. Let A be a diagonal matrix where An is the number of trees in the ith 
grid cell. When there are no trees in a grid cell, Wip = 0 and An = 0. For the township data, at 
each iteration, based on the current values of the grid cell membership variables, {ctj}, trees are 
aggregated into grid cells and the calculations above can then be carried out. 

The conditional distribution for Wijp given the other unknowns in the model and the data is as 
follows. Let TN(a, b, /i, r^) denote the truncated normal distribution with mean parameter p and 
variance parameter r^, truncated below by a and above by b. 


Wi 


ijp 


tn( max Wijp*, oo, ay^■ (s*), l), 

p*¥^yij 

TN( - oo,Wijy^.,ap{si), l), 


if P = Vij 
if P ^ Vij 


In essence, the truncation value is determined by the taxon of the jth tree. Lor a given p, the W 
values for all trees in all cells can be sampled in parallel. 

The conditional distribution of ap is 


Q!p ~ N 



-1 

Awp, 




-1 


where Qp = (a^) for the CAR model and yap ■ ^ j Q{pp) for the SPDE model. Lor each 

hyperparameter, fp = logcXp for the CAR model and fp G {/ip, (log cTp, log Pp)} for the SPDE 
model, we sample {0p, ap} jointly, proposing 0p as a random walk and, conditional on the pro¬ 
posed value of fp, sampling ap from the distribution just above. The joint proposal is accepted or 
rejected as a stand^d M etropolis-Hastings proposal, with adaptation of the proposal (co)variance 
( Shabv and Wellsl 201 lb . The proposal distribution for <pp is a normal distribution (bivariate for 
4>p = (log^p, logpp)). 
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For the township-level data, for a given tree j in township t, we draw the latent tree mem¬ 
bership variable, Ctj G m}, from a diserete distribution by normalizing posterior weights, 

{^jJlLtjl,... produced by multiplying the prior weights by a likelihood eontribution, 

Ltji, i = 1, ... ,m. Ltji is the density of the latent Wtji,..., Wtjp values for the given tree 
under the eondition that Ctj = i, namely the produet of independent normal densities, Wtjp ~ 
N(a!p(sj), 1), over p = 1,..., P. Thus the posterior reweights the prior based on how eonsistent 
the eurrent Wtj values for a tree are with the a values for the eandidate grid eells. 


7.2 Estimating Op{s) via Monte Carlo integration 

In the latent variable representation, 9p{s) never appears explieitly and eannot be ealeulated in 
elosed form. Instead we use Monte Carlo integration over Wtjp, p = 1,..., P to estimate 9p{si). 
The quantity 9p{si) = Vxoh{Wijp = maxlTj^p*) defines the probability of taxon p at grid eell i. 

p* 

This requires one to ehoose the number of Monte Carlo samples, whieh we set at 10000, effeetively 
sampling 10000 hypothetieal trees and estimating the probabilities of the different taxa in the 
population from the empirieal proportions in this sample of trees. For eaeh of the saved MCMC 
samples, A; = 1,... iC, we estimate 9p’^\si) numerieally. Speeifieally, for f = 1,..., 10000 samples 
(i.e., hypothetieal trees), we independently draw 






,P 


and estimate using 




1 

10000 


10000 


t=i 



,P 


where 1 (■) is the indieator funetion that evaluates to 1 if the expression is true and 0 if false. In other 
words, we ealeulate the proportion of times that the maximum of Wup, p = 1,..., P eorresponds 
to taxon p. Considering 9p’^\si), k = 1,..., iC, we have a sample from the posterior of 9p{si). 
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